Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams
نویسندگان
چکیده
Data intensive applications and computing has emerged as a central area of modern research with the explosion of data stored world-wide. Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. Duplicate detection and removal of redundancy from such multibillion datasets helps in resource and compute efficiency for downstream processing. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent. However, SBF suffers from a high false negative rate and slow convergence rate, thereby rendering it inefficient for applications with low false negative rate tolerances. ∗This work was completed at IBM Research, India. Email addresses: [email protected] (Suman K. Bera), [email protected] (Sourav Dutta), [email protected] (Ankur Narang), [email protected] (Souvik Bhattacherjee) Preprint submitted to Information Systems, Elsevier January 9, 2014 ar X iv :1 21 2. 39 64 v1 [ cs .I R ] 1 7 D ec 2 01 2 In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. Using different updation and biasing mechanisms we propose variants of the same model enabling the data structure to adapt to various input scenarios. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.
منابع مشابه
Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams
The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of rese...
متن کاملAn Approximate Duplicate-Elimination in RFID Data Streams Based on d-Left Time Bloom Filter
Article history: Received 6 March 2010 Received in revised form 16 July 2011 Accepted 18 July 2011 Available online 31 July 2011 The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the multiple readings in many cases in detecting an RFID tag and the deployment of multiple readers, RFID data contains many duplica...
متن کاملPrivacy preserving record linkage using homomorphic encryption
The bloom filter method for privacy preserving record linkage [24] has been shown to be both efficient, and provide equivalent linkage quality to that achievable with unencoded identifiers [23]. However in some situations, the bloom filter method may be vulnerable to frequency attacks, which could potentially leak identifying information [18]. In this paper we extend the bloom filter protocol t...
متن کاملProbabilistic Counting with Randomized Storage
Previous work by Talbot and Osborne [2007] explored the use of randomized storage mechanisms in language modeling. These structures trade a small amount of error for significant space savings, enabling the use of larger language models on relatively modest hardware. Going beyond space efficient count storage, here we present the Talbot Osborne Morris Bloom (TOMB) Counter, an extended model for ...
متن کاملAn Efficient Similarity Digests Database Lookup - A Logarithmic Divide & Conquer Approach
Investigating seized devices within digital forensics represents a challenging task due to the increasing amount of data. Common procedures utilize automated file identification, which reduces the amount of data an investigator has to examine manually. In the past years the research field of approximate matching arises to detect similar data. However, if n denotes the number of similarity diges...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1212.3964 شماره
صفحات -
تاریخ انتشار 2012